Author: Willian Pina
This dataset encompasses a comprehensive collection of all shooting incidents that have occurred in New York City from 2006 to the end of the last calendar year. It is updated quarterly and reviewed by the NYPD’s Office of Management Analysis and Planning before being made available to the public. Each record includes details about the incident such as the date, time, location, and demographic information about suspects and victims. This dataset serves as a valuable tool for analyzing the nature of criminal and shooting activity in NYC.
This project is part of the Master’s program in Data Science at the University of Colorado Boulder, taught by Professor Dr. Jane Wall, within the course “Data Science as a Field”.
| Column | Description |
|---|---|
| INCIDENT_KEY | Randomly generated persistent ID for each incident |
| OCCUR_DATE | Exact date of the shooting incident |
| OCCUR_TIME | Exact time of the shooting incident |
| BORO | Borough where the incident occurred |
| LOC_OF_OCCUR_DESC | Description of the incident location |
| PRECINCT | Precinct where the incident occurred |
| JURISDICTION_CODE | Jurisdiction code where the incident occurred |
| LOC_CLASSFCTN_DESC | Description of the location classification |
| LOCATION_DESC | Description of the incident location |
| STATISTICAL_MURDER_FLAG | Indicates whether the shooting resulted in a victim’s death, counted as a murder |
| PERP_AGE_GROUP | Age category of the perpetrator |
| PERP_SEX | Sex of the perpetrator |
| PERP_RACE | Race of the perpetrator |
| VIC_AGE_GROUP | Age category of the victim |
| VIC_SEX | Sex of the victim |
| VIC_RACE | Race of the victim |
| X_COORD_CD | Midblock X-coordinate for the New York State Plane Coordinate System |
| Y_COORD_CD | Midblock Y-coordinate for the New York State Plane Coordinate System |
| Latitude | Latitude coordinate for the global coordinate system |
| Longitude | Longitude coordinate for the global coordinate system |
| Lon_Lat | Longitude and latitude coordinates for mapping |
For more information and access to the data, visit the dataset link: NYPD Shooting Incident Data (Historic).
We will import the data from the URL provided in the dataset source to begin our data analysis process of the dataset.
URL = "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD"
data = read.csv(URL)
head(data)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME BORO LOC_OF_OCCUR_DESC PRECINCT
## 1 244608249 05/05/2022 00:10:00 MANHATTAN INSIDE 14
## 2 247542571 07/04/2022 22:20:00 BRONX OUTSIDE 48
## 3 84967535 05/27/2012 19:35:00 QUEENS 103
## 4 202853370 09/24/2019 21:00:00 BRONX 42
## 5 27078636 02/25/2007 21:00:00 BROOKLYN 83
## 6 230311078 07/01/2021 23:07:00 MANHATTAN 23
## JURISDICTION_CODE LOC_CLASSFCTN_DESC LOCATION_DESC
## 1 0 COMMERCIAL VIDEO STORE
## 2 0 STREET (null)
## 3 0
## 4 0
## 5 0
## 6 2 MULTI DWELL - PUBLIC HOUS
## STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP
## 1 true 25-44 M BLACK 25-44
## 2 true (null) (null) (null) 18-24
## 3 false 18-24
## 4 false 25-44 M UNKNOWN 25-44
## 5 false 25-44 M BLACK 25-44
## 6 false 25-44
## VIC_SEX VIC_RACE X_COORD_CD Y_COORD_CD Latitude Longitude
## 1 M BLACK 986050 214231.0 40.75469 -73.99350
## 2 M BLACK 1016802 250581.0 40.85440 -73.88233
## 3 M BLACK 1048632 198262.0 40.71063 -73.76777
## 4 M BLACK 1014493 242565.0 40.83242 -73.89071
## 5 M BLACK 1009149 190104.7 40.68844 -73.91022
## 6 M BLACK 999061 229912.0 40.79773 -73.94651
## Lon_Lat
## 1 POINT (-73.9935 40.754692)
## 2 POINT (-73.88233 40.854402)
## 3 POINT (-73.76777349199995 40.71063412500007)
## 4 POINT (-73.89071440599997 40.832416753000075)
## 5 POINT (-73.91021857399994 40.68844345900004)
## 6 POINT (-73.94650786199998 40.79772716600007)
Let’s start by analyzing and preparing data from the NYPD Shooting Incident Dataset.
Let’s follow these steps:
Based on this initial analysis, we can decide how to handle any missing values.
# Summary data
summary(data)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME BORO
## Min. : 9953245 Length:28562 Length:28562 Length:28562
## 1st Qu.: 65439914 Class :character Class :character Class :character
## Median : 92711254 Mode :character Mode :character Mode :character
## Mean :127405824
## 3rd Qu.:203131993
## Max. :279758069
##
## LOC_OF_OCCUR_DESC PRECINCT JURISDICTION_CODE LOC_CLASSFCTN_DESC
## Length:28562 Min. : 1.0 Min. :0.0000 Length:28562
## Class :character 1st Qu.: 44.0 1st Qu.:0.0000 Class :character
## Mode :character Median : 67.0 Median :0.0000 Mode :character
## Mean : 65.5 Mean :0.3219
## 3rd Qu.: 81.0 3rd Qu.:0.0000
## Max. :123.0 Max. :2.0000
## NA's :2
## LOCATION_DESC STATISTICAL_MURDER_FLAG PERP_AGE_GROUP
## Length:28562 Length:28562 Length:28562
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## PERP_SEX PERP_RACE VIC_AGE_GROUP VIC_SEX
## Length:28562 Length:28562 Length:28562 Length:28562
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## VIC_RACE X_COORD_CD Y_COORD_CD Latitude
## Length:28562 Min. : 914928 Min. :125757 Min. :40.51
## Class :character 1st Qu.:1000068 1st Qu.:182912 1st Qu.:40.67
## Mode :character Median :1007772 Median :194901 Median :40.70
## Mean :1009424 Mean :208380 Mean :40.74
## 3rd Qu.:1016807 3rd Qu.:239814 3rd Qu.:40.82
## Max. :1066815 Max. :271128 Max. :40.91
## NA's :59
## Longitude Lon_Lat
## Min. :-74.25 Length:28562
## 1st Qu.:-73.94 Class :character
## Median :-73.92 Mode :character
## Mean :-73.91
## 3rd Qu.:-73.88
## Max. :-73.70
## NA's :59
# Converting date and time.
data$OCCUR_DATE = mdy(data$OCCUR_DATE)
data$OCCUR_TIME = hms(data$OCCUR_TIME)
# Converting variables to factor and logical types.
data$BORO = as.factor(data$BORO)
data$PERP_SEX = as.factor(data$PERP_SEX)
data$PERP_RACE = as.factor(data$PERP_RACE)
data$VIC_SEX = as.factor(data$VIC_SEX)
data$VIC_RACE = as.factor(data$VIC_RACE)
data$STATISTICAL_MURDER_FLAG = as.logical(data$STATISTICAL_MURDER_FLAG)
# Removing unnecessary columns
data_clean <- data %>%
filter(complete.cases(data)) %>%
select(-c(X_COORD_CD, Y_COORD_CD, Lon_Lat)) %>%
filter(VIC_AGE_GROUP != "1022", !is.na(VIC_AGE_GROUP))
As observed, the columns Latitude, Longitude, and JURISDICTION_CODE contained a small amount of missing data. Given the relatively minor proportion of these missing entries compared to the overall dataset size, we have decided to permanently remove these rows from the dataset.
Following these modifications, we will proceed to verify the absence of any remaining missing data and confirm the successful exclusion of the specified columns from the dataset.
# Summary data clean
summary(data_clean)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME
## Min. : 9953245 Min. :2006-01-01 Min. :0S
## 1st Qu.: 65274632 1st Qu.:2009-08-31 1st Qu.:3H 30M 0S
## Median : 92550364 Median :2013-09-09 Median :15H 14M 0S
## Mean :127113912 Mean :2014-05-31 Mean :12H 43M 53.7810526315807S
## 3rd Qu.:202504684 3rd Qu.:2019-09-15 3rd Qu.:20H 45M 0S
## Max. :279758069 Max. :2023-12-29 Max. :23H 59M 0S
##
## BORO LOC_OF_OCCUR_DESC PRECINCT JURISDICTION_CODE
## BRONX : 8363 Length:28500 Min. : 1.0 Min. :0.0000
## BROOKLYN :11331 Class :character 1st Qu.: 44.0 1st Qu.:0.0000
## MANHATTAN : 3744 Mode :character Median : 67.0 Median :0.0000
## QUEENS : 4262 Mean : 65.5 Mean :0.3225
## STATEN ISLAND: 800 3rd Qu.: 81.0 3rd Qu.:0.0000
## Max. :123.0 Max. :2.0000
##
## LOC_CLASSFCTN_DESC LOCATION_DESC STATISTICAL_MURDER_FLAG
## Length:28500 Length:28500 Mode :logical
## Class :character Class :character FALSE:22978
## Mode :character Mode :character TRUE :5522
##
##
##
##
## PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP
## Length:28500 : 9310 BLACK :11879 Length:28500
## Class :character (null): 1115 : 9310 Class :character
## Mode :character F : 443 WHITE HISPANIC: 2502 Mode :character
## M :16133 UNKNOWN : 1837
## U : 1499 BLACK HISPANIC: 1388
## (null) : 1115
## (Other) : 469
## VIC_SEX VIC_RACE Latitude
## F: 2753 AMERICAN INDIAN/ALASKAN NATIVE: 11 Min. :40.51
## M:25735 ASIAN / PACIFIC ISLANDER : 440 1st Qu.:40.67
## U: 12 BLACK :20200 Median :40.70
## BLACK HISPANIC : 2787 Mean :40.74
## UNKNOWN : 70 3rd Qu.:40.82
## WHITE : 728 Max. :40.91
## WHITE HISPANIC : 4264
## Longitude
## Min. :-74.25
## 1st Qu.:-73.94
## Median :-73.92
## Mean :-73.91
## 3rd Qu.:-73.88
## Max. :-73.70
##
From now on we will do some visualizations to test some hypotheses and extract some insights.
Our dataset contains georeferenced information about criminal incidents in the state of New York, including the age of the victims. We can use this data to plot the locations of these incidents on a map and segment them by the victims’ age groups. This analysis will allow us to identify if there are specific areas in the state where certain age profiles of victims are more frequently associated with criminal incidents. Thus, we can visually explore the geographic distribution of incidents and investigate potential patterns related to the age of the victims.
# Get a basemap
map_data <- get_stadiamap(bbox = c(left = min(data_clean$Longitude) + 0.01, bottom = min(data_clean$Latitude) + 0.01, right = max(data_clean$Longitude) + 0.01, top = max(data_clean$Latitude)+ 0.01), maptype = "stamen_toner_lite")
# Create the chart
gg <- ggmap(map_data) +
geom_point(data = data_clean, aes(x = Longitude, y = Latitude, color = VIC_AGE_GROUP), alpha = 0.5, size = 3) +
scale_color_manual(values = c("18-24" = "blue", "25-44" = "red", "45-64" = "green", "65+" = "yellow", "<18" = "purple", "UNKNOWN" = "grey"),
name = "Age Range") +
labs(title = "Map of Shooting Incidents by Victim Age Category",
subtitle = "NYPD Shooting Incident Data",
caption = "Source: NYPD Shooting Incident Data") +
theme_minimal() +
theme(plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 14),
plot.caption = element_text(size = 12),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
# Show the graph
print(gg)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
Based on the analysis of the image, it is observed that there is no specific location with a predominance of crimes according to the demographic profile. However, there is a significant trend of crimes involving victims aged 25 to 44 years, indicated by the predominant red color.
Additionally, it is noted that the island to the west in the state of New York shows few incidents. This region, primarily characterized as a park area, naturally has less foot traffic, which may explain the low incidence of reported crimes there.
To confirm the initial observation that the majority of the victims belong to the age group of 25 to 45 years, a more detailed analysis of the data will be conducted.
# Group data by victim's age category and count events
age_data <- data_clean %>%
group_by(VIC_AGE_GROUP) %>%
summarise(Count = n(), .groups = 'drop')
# Create the bar chart
gg <- ggplot(age_data, aes(x = VIC_AGE_GROUP, y = Count, fill = VIC_AGE_GROUP)) +
geom_bar(stat = "identity", color = "black") +
labs(title = "Number of Shooting Incidents by Victim Age Category",
x = "Victim Age Category",
y = "Number of Incidents",
fill = "Age Category") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Plot the graph
print(gg)
The analysis of the data reveals that individuals aged 25 to 44 are the most affected, followed by young people between 18 and 24 years old.
To analyze whether there are significant differences between the genders of criminals in terms of criminal activity over time, we can construct a line graph that presents the monthly aggregation of incidents.
This graph will allow us to observe trends and discrepancies between male and female categories over the months. Visualizing these trends can provide valuable insights into the patterns of criminal behavior associated with each gender.
# Filter only entries where the gender of the perpetrator is known
data_clean <- data %>%
filter(PERP_SEX %in% c("M", "F"))
# Reorder the factors so that M is above F in the legend
data_clean$PERP_SEX <- factor(data_clean$PERP_SEX, levels = c("M", "F"))
# Group data by month/year and gender of perpetrator
timeline_data <- data_clean %>%
group_by(Month = floor_date(OCCUR_DATE, "month"), PERP_SEX) %>%
summarise(Count = n(), .groups = 'drop')
# Create the line chart
gg <- ggplot(timeline_data, aes(x = Month, y = Count, color = PERP_SEX, group = PERP_SEX)) +
geom_line() +
labs(title = "Monthly Timeline of Shooting Incidents by Perpetrator's Sex",
x = "Date",
y = "Number of Incidents",
color = "Perpetrator's Sex") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Show the graph
print(gg)
We verified that in the space-time considered, we observed some interesting points:
Our dataset includes the OCCUR_TIME column, which
records the exact time of each incident. Using this information, we can
analyze and identify the time periods during which incidents are most
frequent in the state of New York.
This analysis will allow us to better understand the temporal patterns of the incidents and potentially direct prevention and response measures more effectively.
# data_clean$OCCUR_TIME <- hms(data_clean$OCCUR_TIME)
data_clean$Hour <- hour(data_clean$OCCUR_TIME)
# Ensure that all times are represented even if there are no incidents
full_hours <- data.frame(Hour = 0:23)
hourly_data <- full_hours %>%
left_join(data_clean %>% group_by(Hour) %>% summarise(Incidents = n(), .groups = 'drop'), by = "Hour") %>%
replace_na(list(Incidents = 0))
# Generate labels for hours
hourly_data$HourLabel <- sprintf("%02d:00", hourly_data$Hour)
# Create the radial graph
fig <- plot_ly(
data = hourly_data,
type = 'scatterpolar',
mode = 'lines+markers',
r = hourly_data$Incidents,
theta = hourly_data$HourLabel,
fill = 'toself',
line = list(color = 'blue')
) %>%
layout(
polar = list(
radialaxis = list(
visible = T,
range = c(0, max(hourly_data$Incidents) + 10)
),
angularaxis = list(
direction = "clockwise", # Set to clockwise
rotation = 90,
type = 'category',
showline = FALSE,
tickmode = 'array',
tickvals = hourly_data$Hour,
ticktext = hourly_data$HourLabel
)
),
title = "Number of Incidents during the Day",
margin = list(t = 100)
)
# Show the graph
fig
The analysis of the radial graph showing the frequency of occurrences by hour reveals that the period between 21:00 and 23:00 has the highest incidence of incidents. Conversely, the hours between 05:00 and 13:00 show a significant reduction in the number of events.
Interestingly, there is an escalation in occurrences starting at 18:00, which suggests an increase in the likelihood of incidents during this time. This can be attributed to people’s behavior as they are either returning home or going out for evening activities after the end of the workday, thus increasing their exposure to potential incidents.
Exploring the dataset on shooting incidents in New York, we identify it as a valuable tool for society to assist the government in shaping security policies. This set includes variables such as race, gender, and location (neighborhood), which could be thoroughly analyzed to understand the dynamics of security across different regions. A pertinent question would be to investigate whether more affluent neighborhoods record crimes at the same proportion as other less privileged areas. This could inspire specific policies to balance this distribution.
Moreover, analyzing gender and race in the incidents could open a dialogue about potential biases in these categories, but it is crucial to maintain a clear focus to avoid deviations from the initial objective of the analysis. Variables of social class add another layer of complexity and should be approached with a defined purpose to prevent divergent debates.
An interesting point noted was the predominance of shootings during nighttime, raising hypotheses that they might be motivated by the absence of police forces on patrols. However, considering that the New York Police Department is known to be well-equipped and trained, such a factor may be less influential than initially presumed.
It is also important to highlight that the dataset is reviewed by the Office of Management Analysis and Planning, which can influence how data is presented. This review can either intensify or soften certain aspects of the data, potentially creating an analytical bias that favors interpretations aligned with political interests, especially in a context where one political party has dominated for years.
These considerations underline the need for careful and objective analysis, always seeking clarity in objectives so that conclusions are based on robust evidence and not on premises influenced by potential political or social biases.